Ever heard of Mobike, ofo and oBike before? All three of them are bike rental services which employ the use of smartphone applications to create station-less and cashless bike rentals possible to users. By using such platforms, bike rentals can be automated and tracked.
Objective:
To predict the total number of bike rentals each day based on measurable variables such as weather conditions and date-time.
Data is first cleaned by removing columns that are not useful, converting variables to appropriate type and checked for missing values. Especially for target variable, count, the whole row of data has to be removed if there are any missing values as the target variable should not be manipulated.
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
## Skim summary statistics
## n obs: 731
## n variables: 16
##
## -- Variable type:factor ---------------------------------------------------------------------------------------------------
## variable missing complete n n_unique top_counts
## dteday 0 731 731 731 201: 1, 201: 1, 201: 1, 201: 1
## ordered
## FALSE
##
## -- Variable type:integer --------------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75
## casual 0 731 731 848.18 686.62 2 315.5 713 1096
## cnt 0 731 731 4504.35 1937.21 22 3152 4548 5956
## holiday 0 731 731 0.029 0.17 0 0 0 0
## instant 0 731 731 366 211.17 1 183.5 366 548.5
## mnth 0 731 731 6.52 3.45 1 4 7 10
## registered 0 731 731 3656.17 1560.26 20 2497 3662 4776.5
## season 0 731 731 2.5 1.11 1 2 3 3
## weathersit 0 731 731 1.4 0.54 1 1 1 2
## weekday 0 731 731 3 2 0 1 3 5
## workingday 0 731 731 0.68 0.47 0 0 1 1
## yr 0 731 731 0.5 0.5 0 0 1 1
## p100 hist
## 3410 <U+2587><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
## 8714 <U+2582><U+2585><U+2585><U+2587><U+2587><U+2585><U+2585><U+2582>
## 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 731 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
## 12 <U+2587><U+2585><U+2587><U+2583><U+2585><U+2587><U+2585><U+2587>
## 6946 <U+2581><U+2585><U+2585><U+2586><U+2587><U+2585><U+2583><U+2583>
## 4 <U+2587><U+2581><U+2587><U+2581><U+2581><U+2587><U+2581><U+2587>
## 3 <U+2587><U+2581><U+2581><U+2585><U+2581><U+2581><U+2581><U+2581>
## 6 <U+2587><U+2587><U+2587><U+2587><U+2581><U+2587><U+2587><U+2587>
## 1 <U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
##
## -- Variable type:numeric --------------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75 p100
## atemp 0 731 731 0.47 0.16 0.079 0.34 0.49 0.61 0.84
## hum 0 731 731 0.63 0.14 0 0.52 0.63 0.73 0.97
## temp 0 731 731 0.5 0.18 0.059 0.34 0.5 0.66 0.86
## windspeed 0 731 731 0.19 0.077 0.022 0.13 0.18 0.23 0.51
## hist
## <U+2581><U+2585><U+2587><U+2587><U+2587><U+2587><U+2586><U+2581>
## <U+2581><U+2581><U+2581><U+2583><U+2587><U+2587><U+2585><U+2582>
## <U+2581><U+2583><U+2587><U+2586><U+2586><U+2587><U+2587><U+2582>
## <U+2582><U+2586><U+2587><U+2586><U+2582><U+2581><U+2581><U+2581>
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "holiday" "weekday" "workingday" "weathersit" "temp"
## [11] "atemp" "hum" "windspeed" "casual" "registered"
## [16] "cnt"
data <- day[,-c(1,14:15)]
#convert to factor: season,yr,mnth,holiday,weekday,workingday,hr
str(data)## 'data.frame': 731 obs. of 13 variables:
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
data$season <- as.factor(data$season)
data$yr <- as.factor(data$yr)
data$mnth <- as.factor(data$mnth)
data$holiday <- as.factor(data$holiday)
data$weekday <- factor(data$weekday, levels = c(1,2,3,4,5,6,0),
labels = c("Monday","Tuesday","Wednesday", "Thursday","Friday","Saturday","Sunday"))
data$workingday <- as.factor(data$workingday)
#convert to time-based: dteday
data$dteday <- ymd(data$dteday)
#check for missing values
colSums(is.na(data))## dteday season yr mnth holiday weekday
## 0 0 0 0 0 0
## workingday weathersit temp atemp hum windspeed
## 0 0 0 0 0 0
## cnt
## 0
Data is explored with simple visualisation of variables by categories: Date-Time and Weather Conditions.
#by date-time
year_lab <- c(`0` = "2011",
`1` = "2012")
data %>%
ggplot(aes(x=year, y=count))+
geom_boxplot()+
labs(title="Bike Rentals by Year")+
scale_x_discrete(breaks=c(0,1), labels=c(2011,2012))data %>%
ggplot(aes(x=month, y=count))+
geom_col(aes(fill=month))+
labs(title="Bike Rentals by Month")data %>%
ggplot(aes(x=season, y=count))+
geom_boxplot(aes(fill=season), show.legend = F)+
labs(title="Bike Rentals by Season")+
scale_x_discrete(breaks=c(1,2,3,4), labels=c("Spring","Summer","Fall","Winter"))+
facet_wrap(~year, labeller = as_labeller(year_lab))#number of rentals doubled in 2012
#interestingly, spring has the lowest rental numbers
workday_lab <- c(`0` = "No Work Day",
`1` = "Work Day")
data %>%
ggplot(aes(x=weekday, y=count))+
geom_col(aes(fill=weekday))+
facet_wrap(~workingday, labeller=as_labeller(workday_lab))+
labs(title="Bike Rentals by Weekday")+
theme(axis.text.x = element_text(hjust=1,angle = 45))weather_lab <- c(`1` = "Clear Day",
`2` = "Cloudy Day",
`3` = "Rain/Snow")
data %>%
ggplot(aes(x=weekday, y=count))+
geom_col()+
facet_wrap(~weathersit, labeller=as_labeller(weather_lab))+
labs(title="Bike Rentals by Weather Situation")+
theme_light()+
theme(axis.text.x = element_text(hjust =1, angle = 45))Machine Learning algorithms are used to build models, namely linear regression and random forest. Data is processed according the requirements of each algorithm. The model is cross-validated by splitting the data into train, validation and test sets to carry out evaluation via regression metrics (R squared and RMSE).
Base model is created with all inputs using linear regression. It will form the baseline for subsequent comparison of models. Linear regression used to build 2 models with different inputs, so as to focus on different aspects of modelling.
#split data into train and test sets
set.seed(100)
split <- sample(nrow(data), nrow(data)*0.7)
train <- data[split,]
hold <- data[-split,]
set.seed(101)
split2 <- sample(nrow(hold), nrow(hold)*0.5)
valid <- hold[split2,]
test <- hold[-split2,]
#check dimensions of sets
dim(data)## [1] 731 12
## [1] 511 12
## [1] 110 12
## [1] 110 12
##
## Call:
## lm(formula = count ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3764.5 -345.5 58.2 454.5 2881.0
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2417.73 274.95 8.793 < 2e-16 ***
## season2 875.73 211.70 4.137 4.16e-05 ***
## season3 622.70 252.31 2.468 0.013931 *
## season4 1433.96 220.08 6.516 1.82e-10 ***
## year1 2065.95 70.63 29.250 < 2e-16 ***
## month2 39.75 170.92 0.233 0.816182
## month3 568.22 200.17 2.839 0.004721 **
## month4 359.42 296.90 1.211 0.226648
## month5 703.67 317.54 2.216 0.027155 *
## month6 442.64 333.99 1.325 0.185692
## month7 38.57 372.00 0.104 0.917453
## month8 418.34 361.69 1.157 0.247999
## month9 1108.92 310.11 3.576 0.000384 ***
## month10 528.49 285.96 1.848 0.065196 .
## month11 -87.42 276.98 -0.316 0.752429
## month12 -61.14 221.87 -0.276 0.783015
## holiday1 -674.70 224.47 -3.006 0.002788 **
## weekdayTuesday 42.10 129.34 0.325 0.744959
## weekdayWednesday 138.93 128.64 1.080 0.280671
## weekdayThursday 94.36 130.66 0.722 0.470561
## weekdayFriday 221.46 134.83 1.643 0.101136
## weekdaySaturday 148.56 133.33 1.114 0.265731
## weekdaySunday -156.60 128.89 -1.215 0.224967
## workingday1 NA NA NA NA
## weathersit -567.52 88.37 -6.422 3.22e-10 ***
## temp 3159.17 1509.23 2.093 0.036849 *
## atemp 1692.18 1564.79 1.081 0.280054
## humid -1647.26 368.26 -4.473 9.62e-06 ***
## windspd -3110.50 494.90 -6.285 7.33e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 772.9 on 483 degrees of freedom
## Multiple R-squared: 0.8517, Adjusted R-squared: 0.8434
## F-statistic: 102.7 on 27 and 483 DF, p-value: < 2.2e-16
## Warning in predict.lm(fit_base, valid): prediction from a rank-deficient
## fit may be misleading
## pred_base valid.count
## 245 4658.6684 4727
## 195 4298.3181 5084
## 580 7136.4956 7261
## 311 3695.1139 4035
## 695 5079.0197 2424
## 611 6680.9761 5810
## 723 3651.1547 1787
## 204 5471.9194 3285
## 615 6706.4789 6203
## 199 4317.4651 4458
## 319 3449.0964 4195
## 418 4344.9917 4773
## 429 4164.8387 3423
## 113 2701.0091 4036
## 41 1282.5796 1538
## 312 3822.9247 4205
## 697 4100.3736 3959
## 317 3530.1051 3717
## 517 7387.1097 7338
## 203 5465.8139 3387
## 708 4918.4981 5582
## 601 7017.2299 7765
## 702 4527.4067 4649
## 585 6414.0011 7273
## 57 1942.0977 1969
## 315 2572.9160 3368
## 154 4971.5249 5312
## 668 4313.3786 22
## 384 3428.0297 3292
## 149 4409.2423 4788
## 294 4467.0081 4304
## 151 5752.5081 3982
## 706 5246.8802 5375
## 715 5632.0520 5047
## 284 4224.3734 4563
## 479 3902.6284 3214
## 452 5897.8566 5102
## 242 4666.8476 5204
## 70 1793.7724 1977
## 29 1452.8368 1098
## 549 6890.9933 6227
## 43 1606.6884 1472
## 505 7486.1834 8294
## 455 5432.3238 5459
## 678 5416.9158 5315
## 134 3563.8571 3409
## 275 2805.0003 2918
## 152 4614.6465 3974
## 96 3521.0655 2808
## 681 5666.8000 6852
## 10 1092.0868 1321
## 602 6687.9923 7582
## 217 4855.7360 4866
## 103 2537.1257 2162
## 493 5706.4107 6273
## 266 5009.3035 2395
## 414 4326.6253 4318
## 365 2234.4421 2485
## 511 7259.3364 6734
## 514 6496.0449 6043
## 500 5707.6101 2843
## 525 7202.2055 7736
## 484 5350.3622 4220
## 369 2771.6942 2368
## 63 1693.6487 1944
## 575 6794.5243 6685
## 109 3397.6963 3204
## 391 3628.8853 4075
## 127 4697.7399 4714
## 462 5752.7083 6460
## 6 1852.4679 1606
## 279 4838.6921 4765
## 60 2195.0317 1851
## 648 5659.0598 6392
## 451 5520.5617 5558
## 179 4987.1026 4648
## 410 3726.7161 3922
## 282 4855.5788 5511
## 190 4574.9023 5336
## 17 -26.2820 1000
## 216 3925.0026 4576
## 207 4657.0306 4590
## 398 3728.2081 3761
## 407 1928.4787 2169
## 613 7166.4951 6864
## 378 3413.6381 3214
## 141 5041.9030 5805
## 592 6782.9130 6784
## 720 4913.3419 4128
## 310 3257.3918 3649
## 196 4342.4285 5538
## 582 7072.1783 6824
## 33 613.8389 1526
## 180 4828.6704 5225
## 139 3794.5784 4575
## 290 4864.1825 4570
## 75 1853.6084 2192
## 485 5873.6381 6304
## 280 5304.8178 4985
## 286 4216.4535 2913
## 527 7193.6850 6598
## 540 7106.5521 7458
## 231 4156.5917 4153
## 252 4402.9273 3544
## 411 4350.3431 4169
## 674 5258.1421 5107
## 629 6983.6091 7720
## 51 1737.5445 1812
## 259 4015.5856 4760
## 234 4521.3410 4758
#R squared (correlation of predicted and actual values)
r_base <- R2(pred_base, valid$count)
#RMSE (avg diff of predicted and actual values)
rmse_base <- RMSE(pred_base,valid$count)
base_eval <- c(r_base,rmse_base)
#stepwise to find optimum variables for modelling
fit_log_step <- step(fit_base)## Start: AIC=6823.63
## count ~ season + year + month + holiday + weekday + workingday +
## weathersit + temp + atemp + humid + windspd
##
##
## Step: AIC=6823.63
## count ~ season + year + month + holiday + weekday + weathersit +
## temp + atemp + humid + windspd
##
## Df Sum of Sq RSS AIC
## - weekday 6 6356379 294873883 6822.8
## - atemp 1 698564 289216068 6822.9
## <none> 288517504 6823.6
## - temp 1 2617350 291134854 6826.2
## - holiday 1 5396496 293914000 6831.1
## - humid 1 11952136 300469640 6842.4
## - windspd 1 23596949 312114454 6861.8
## - weathersit 1 24635902 313153407 6863.5
## - season 3 33491030 322008534 6873.8
## - month 11 46586296 335103800 6878.1
## - year 1 511058147 799575652 7342.5
##
## Step: AIC=6822.77
## count ~ season + year + month + holiday + weathersit + temp +
## atemp + humid + windspd
##
## Df Sum of Sq RSS AIC
## - atemp 1 391331 295265215 6821.4
## <none> 294873883 6822.8
## - temp 1 3533858 298407742 6826.9
## - holiday 1 6821362 301695245 6832.5
## - humid 1 13858639 308732522 6844.2
## - weathersit 1 22455278 317329161 6858.3
## - windspd 1 23938053 318811936 6860.7
## - season 3 32383455 327257339 6870.0
## - month 11 45693105 340566989 6874.4
## - year 1 517941045 812814928 7338.9
##
## Step: AIC=6821.44
## count ~ season + year + month + holiday + weathersit + temp +
## humid + windspd
##
## Df Sum of Sq RSS AIC
## <none> 295265215 6821.4
## - holiday 1 7013037 302278252 6831.4
## - humid 1 13615439 308880654 6842.5
## - weathersit 1 22937032 318202247 6857.7
## - windspd 1 25860063 321125278 6862.3
## - season 3 32536878 327802093 6868.9
## - month 11 45446975 340712190 6872.6
## - temp 1 58199721 353464936 6911.4
## - year 1 517570698 812835913 7336.9
## lm(formula = count ~ season + year + month + holiday + weathersit +
## temp + humid + windspd, data = train)
#model 1
fit_log <- lm(count ~ season + year + month + holiday + workingday + weathersit +
temp + windspd, train)
#prediction
pred_log <- predict(fit_log, valid)
#model evaluation
r_log <- R2(pred_log, valid$count)
rmse_log <- RMSE(pred_log, valid$count)
log_eval <- c(r_log,rmse_log)
#model 2
fit_log2 <- lm(count ~ season + weekday + weathersit + atemp + humid + windspd, train)
#prediction
pred_log2 <- predict(fit_log2, valid)
#model evaluation
data.frame(pred_log2,valid$count)## pred_log2 valid.count
## 245 5101.0470 4727
## 195 5686.9942 5084
## 580 6118.0724 7261
## 311 4916.8266 4035
## 695 4207.6974 2424
## 611 4869.7204 5810
## 723 2751.4358 1787
## 204 7351.6995 3285
## 615 4860.8709 6203
## 199 5691.4309 4458
## 319 5056.3125 4195
## 418 3628.2591 4773
## 429 2602.3634 3423
## 113 3482.0761 4036
## 41 2077.6811 1538
## 312 5207.8404 4205
## 697 3032.9026 3959
## 317 4966.0968 3717
## 517 6310.6621 7338
## 203 7634.9503 3387
## 708 4030.3045 5582
## 601 5855.4228 7765
## 702 3628.5019 4649
## 585 5608.6594 7273
## 57 2988.1085 1969
## 315 4585.6939 3368
## 154 6518.6755 5312
## 668 2863.2262 22
## 384 2206.8559 3292
## 149 4996.3244 4788
## 294 5419.6967 4304
## 151 6958.9713 3982
## 706 4240.3491 5375
## 715 4789.8449 5047
## 284 5215.2373 4563
## 479 2272.2913 3214
## 452 4709.6206 5102
## 242 5696.8779 5204
## 70 2494.2979 1977
## 29 2332.5656 1098
## 549 6319.4308 6227
## 43 2556.3767 1472
## 505 6486.3920 8294
## 455 4403.7303 5459
## 678 4659.5450 5315
## 134 4111.3374 3409
## 275 3244.7327 2918
## 152 6200.6768 3974
## 96 4606.3944 2808
## 681 4937.2431 6852
## 10 1859.0577 1321
## 602 5969.6412 7582
## 217 5987.1976 4866
## 103 3481.3361 2162
## 493 4441.1674 6273
## 266 5144.6691 2395
## 414 3501.9325 4318
## 365 3584.1812 2485
## 511 6151.6494 6734
## 514 6044.1849 6043
## 500 4325.9660 2843
## 525 6577.6801 7736
## 484 4448.0175 4220
## 369 1872.5322 2368
## 63 2364.1726 1944
## 575 6180.1587 6685
## 109 4597.1009 3204
## 391 2774.5991 4075
## 127 5573.9838 4714
## 462 4877.6032 6460
## 6 2954.7653 1606
## 279 5664.6255 4765
## 60 2698.5193 1851
## 648 4396.3379 6392
## 451 4153.5475 5558
## 179 6083.1296 4648
## 410 3133.5605 3922
## 282 5720.8821 5511
## 190 5913.3252 5336
## 17 1581.2976 1000
## 216 4865.4916 4576
## 207 6158.6192 4590
## 398 2912.8767 3761
## 407 707.7338 2169
## 613 5358.5883 6864
## 378 2359.7311 3214
## 141 5912.5872 5805
## 592 5731.1775 6784
## 720 4087.7696 4128
## 310 4441.9067 3649
## 196 5874.3195 5538
## 582 6180.4389 6824
## 33 1535.2159 1526
## 180 5913.9352 5225
## 139 4432.7771 4575
## 290 5742.4267 4570
## 75 2566.2879 2192
## 485 4827.7845 6304
## 280 6450.1790 4985
## 286 4962.8655 2913
## 527 6417.0133 6598
## 540 5993.1204 7458
## 231 5317.1377 4153
## 252 4505.0647 3544
## 411 3638.8774 4169
## 674 4432.6388 5107
## 629 5029.6588 7720
## 51 2831.5921 1812
## 259 4365.3394 4760
## 234 5491.5700 4758
Random forest trained with all inputs as results are not affected by multi-collinearity.
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 511 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 511 -none- numeric
## importance 28 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 511 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 28 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
After taking a look at all the predicted values of the 3 models generated, a combination of the 3 models seem to improve the model even further. The predicted values of all 3 models are added up, and the mean is taken.
#create dataframe of all predictions with actual target variable (this becomes train set for ensemble)
predDF <- data.frame(pred_log, pred_log2, pred_rf, count=valid$count)
#Simple Average
simple_avg <- predDF %>%
mutate(avg_count = (pred_log+pred_log2+pred_rf)/3)
#predictions of all learners on test set (this becomes test set for ensemble)
testpred_log <- predict(fit_log, test)
testpred_log2 <- predict(fit_log2, test)
testpred_rf <- predict(fit_rf, test)
testpredDF <- data.frame(testpred_log, testpred_log2, testpred_rf, count=test$count)
#prediction
combpred <- testpredDF %>%
mutate(avg_count = (testpred_log+testpred_log2+testpred_rf)/3)
data.frame(combpred$avg_count,testpredDF$count)## combpred.avg_count testpredDF.count
## 1 2074.882 1600
## 2 1698.836 1927
## 3 1815.024 1096
## 4 3919.159 2927
## 5 1192.803 1107
## 6 2073.535 1685
## 7 2859.824 2133
## 8 2391.030 623
## 9 2872.640 2046
## 10 2399.426 1693
## 11 2717.156 1536
## 12 2625.201 1795
## 13 2502.175 2455
## 14 3667.243 3126
## 15 4744.281 3944
## 16 4565.173 4608
## 17 5023.174 4182
## 18 4520.635 4864
## 19 4145.239 4274
## 20 5011.919 4758
## 21 5411.871 4548
## 22 5700.599 4401
## 23 4950.781 5020
## 24 4280.183 4835
## 25 5278.944 5202
## 26 4715.699 4708
## 27 5534.900 5362
## 28 4247.892 4040
## 29 4787.763 3574
## 30 3861.135 3820
## 31 4676.038 4338
## 32 5138.503 5895
## 33 5163.571 4661
## 34 5106.529 4484
## 35 2566.307 2710
## 36 4935.359 5345
## 37 5215.693 4785
## 38 3255.182 2416
## 39 4528.447 3644
## 40 4606.360 4308
## 41 4189.373 4186
## 42 3838.426 3926
## 43 3075.469 3053
## 44 4260.287 2792
## 45 4079.546 3071
## 46 3625.516 3727
## 47 3959.692 3620
## 48 3652.589 3740
## 49 3532.465 3577
## 50 2712.614 3068
## 51 1711.346 1162
## 52 2088.254 2302
## 53 3663.788 3425
## 54 2608.211 2177
## 55 2738.140 2493
## 56 2425.748 2311
## 57 2751.424 2802
## 58 1894.750 1529
## 59 2666.665 2689
## 60 3058.675 3129
## 61 3860.078 4322
## 62 4027.260 4066
## 63 4071.945 4916
## 64 3935.190 4911
## 65 5049.974 5298
## 66 5947.711 6398
## 67 6212.993 6565
## 68 5927.017 6624
## 69 5179.156 5633
## 70 5165.953 5026
## 71 5500.520 6233
## 72 5084.455 5572
## 73 6032.964 6169
## 74 6719.505 7429
## 75 6409.833 6118
## 76 6018.747 5260
## 77 6401.092 6591
## 78 5783.529 4127
## 79 5626.545 7001
## 80 6762.934 6825
## 81 6852.766 6879
## 82 6287.605 6031
## 83 6620.359 6591
## 84 4148.225 4459
## 85 6457.985 7592
## 86 6743.655 8173
## 87 6225.134 5464
## 88 6759.493 7605
## 89 5672.157 6530
## 90 5493.805 6053
## 91 6809.111 7713
## 92 6719.588 6140
## 93 6520.428 7525
## 94 6879.013 8009
## 95 4776.544 4073
## 96 6662.653 8167
## 97 7081.165 8555
## 98 7239.723 8156
## 99 6444.839 7965
## 100 6270.561 7570
## 101 7111.833 7693
## 102 5995.340 7444
## 103 5883.374 7852
## 104 4622.118 5138
## 105 4807.794 5499
## 106 5666.270 3910
## 107 5589.277 5729
## 108 2545.245 1013
## 109 1990.668 2114
## 110 2416.977 1796
#model evaluation
r_stack <- R2(combpred$avg_count, test$count)
rmse_stack <- RMSE(combpred$avg_count, test$count)
stack_eval <- c(r_stack,rmse_stack)
evalDF <- data.frame(rbind(base_eval,log_eval,log2_eval,rf_eval,stack_eval))
colnames(evalDF) <- c("R.squared","RMSE")
row.names(evalDF) <- c("Base", "Log", "Log2", "RF", "Simple_Avg")
evalDF %>%
group_by(R.squared) %>%
ggplot(aes(x=row.names(evalDF),y=R.squared))+
geom_col(aes(fill=row.names(evalDF)))+
labs(title="R squared of all models", x="Model", fill="Models")evalDF %>%
ggplot(aes(x=row.names(evalDF), y=RMSE))+
geom_col(aes(fill=row.names(evalDF)))+
labs(title="RMSE of all models", x="Model", fill="Models")Residuals vs Fitted: shows a random pattern
Normal Q-Q: the residuals fairly follow the straight line and are normally distributed, so we can assume normality
Scale-Location: The residuals look randomly spread out along the line.
Residuals vs Leverage: All cases are within boundaries of Cook’s distance, hence no outliers which are influential to the regression results.
By accurately predicting the number of bike rentals per day, companies can anticipate the bike volumes required and allocate resources efficiently to manage their bike fleet according to user demand.